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LOCAL PROPER SCORING RULES OF ORDER TWO 
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and University of Heidelberg 

Scoring rules assess the quality of probabilistic forecasts, by as- 
signing a numerical score based on the predictive distribution and on 
the event or value that materializes. A scoring rule is proper if it en- 
courages truthful reporting. It is local of order k if the score depends 
on the predictive density only through its value and the values of its 
derivatives of order up to k at the realizing event. Complementing 
fundamental recent work by Parry, Dawid and Lauritzen, we char- 
acterize the local proper scoring rules of order 2 relative to a broad 
class of Lebesgue densities on the real line, using a different approach. 
In a data example, we use local and nonlocal proper scoring rules to 
assess statistically postprocessed ensemble weather forecasts. 

1. Introduction. One of the major purposes of statistical analysis is to 
make forecasts for the future, and to provide suitable measures of the un- 
certainty associated with them. Consequently, forecasts ought to be prob- 
abilistic in nature, taking the form of probability distributions over future 
quantities and events [Dawid (1984)]. Scoring rules provide summary mea- 
sures for the evaluation of probabilistic forecasts, by assigning a numerical 
score based on the predictive distribution and on the event or value that 
materializes. We take scoring rules to be negatively oriented losses that 
a forecaster wishes to minimize. Specifically, if the forecaster quotes the 
predictive distribution Q and the event x materializes, her loss is S{x,Q). 
The function S(-,Q) takes values in the extended real line, M= [— oo,cx)], 
and we write S{P,Q) for the expected value of S{-,Q) under P. Suppose, 
then, that the forecaster's best judgment is the predictive distribution P. 
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The forecaster has no incentive to predict any Q ^ and is encouraged to 
quote her true behef, Q = P, if S{P,P) < S{P,Q). A scoring rule with this 
property is said to be proper [Gneiting and Raftery (2007)]. 

Our paper is concerned with local proper scoring rules for probabilistic 
forecasts of a real-valued quantity. Briefly, if the predictive distribution is 
absolutely continuous, it can be argued that S{x,Q) ought to depend only 
on the behavior of the predictive density, q, in an infinitesimal neighborhood 
of the observation that materializes, x. Any such scoring rule is said to be 
local, with the logarithmic scoring rule. 



being the most prominent example [Good (1952)]. Another example is the 
Hyvarinen (2005) score, 



which is local of order 2, in the sense that it depends on the predictive 
density only by its value, and the values of its first and second derivative, 
at the observation. Similarly, the logarithmic score can be considered to be 
local of order zero; in fact, it is the only such score that is proper, up to 
equivalence [Bernardo (1979)]. The Hyvarinen score is also proper [Dawid 
and Lauritzen (2005)], thus raising the question for a characterization of the 
local proper scoring rules of order k <2. 

In a far-reaching recent paper. Parry, Dawid and Lauritzen (2012) achieve 
a characterization of the key local score functions of any order k>0. They 
derive these scores from the Euler-Lagrange equation of the calculus of vari- 
ations, thereby obtaining natural candidates for local proper scoring rules, 
the actual propriety of which can be checked by additional criteria. We com- 
plement these results — for more detailed comments, see Remark 3.4 — by 
developing an alternative approach, restricting ourselves to the practically 
most relevant case of the local proper scoring rules of order k <2. Our main 
contributions are the following: we build on a characterization of proper scor- 
ing rules via concave functionals and their (super-)gradients, which yields 
the general form of the second-order local proper scoring rules in a natural 
tangent construction; and we specify suitable classes of scoring rules and 
predictive densities that allow for a full-fledged, rigorous characterization. 

The remainder of the paper is organized as follows. Section 2 introduces 
the notions of propriety and locality in full detail. Section 3 presents our main 
result, in that we characterize the class of the local scoring rules of order 2 
that are proper relative to a comprehensive family of Lebesgue densities, 
which includes many of the classical location-scale families on the real line. 





(2) 




((lng)'(x))2 + 2(lng)"(x) 
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In addition, we discuss the relations to and distinctions from the work of 
Parry, Dawid and Lauritzen (2012). The proof of our main result is given in 
Section 4. Section 5 provides supplements and examples, and a data example 
on ensemble weather forecasts is given in Section 6. Section 7 closes with 
a discussion of open problems and hints at possible future developments and 
applications. 

2. Local proper scoring rules. Initially, we consider predictive distribu- 
tions on a general sample space, 0. Let ^ be a cr-algebra of subsets of 0, 
and let be a class of probability measures on {Q,A). A function on Q is 
A4-quasi-integrable if it is measurable with respect to A and quasi- integrable 
with respect to all Q S [Bauer (2001), page 64]. A probabilistic forecast 
or a predictive distribution is any probability measure Q & A4. A scoring 
rule is any extended real-valued function S : O x — > M such that S{-,Q) is 
A4-quasi- integrable for all Q £ ^A. Hence, if the predictive distribution is Q 
and the event u materializes, the forecaster's loss is S{uj,Q). We define 



as the expected score under P when the probabilistic forecast is Q. This 
is a well-defined extended real-valued quantity, because S{-,Q) is quasi- 
integrable with respect to P. 

Definition 2.1. The scoring rule S is proper relative to A4 if 



It is strictly proper relative to M if S{P,P) < S{P,Q) with equality if and 
only if Q = P. 

The term proper was coined by Winkler and Murphy (1968), while the 
general idea can be traced to Brier (1950) and Good (1952). Dawid (2008) 
provides a concise history of proper scoring rules, which includes major con- 
tributions by the subjective school of probability as well as meteorologists. 

A scoring rule can be thought of as local if S{uj,Q) depends on the pre- 
dictive distribution, Q, only through its behavior in an infinitesimal neigh- 
borhood of the verifying observation, uj. Bernardo [(1979), page 689] argued 
in this vein, noting that "when assessing the worthiness of a scientist's final 
conclusions, only the probability he attaches to a small interval containing 
the true value should be taken into account." In the context of predictive 
densities, the class Al is a family of probability measures that are abso- 
lutely continuous with respect to a u-finite measure fi on {i},A). We then 
identify a probabilistic forecast Q (z A4 with its //-density, q, which we call 
a predictive density or a density forecast. The classical example of a local 




S(P,P)<S(P,Q) 



for all P,Q£M. 
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proper scoring rule is the aforementioned logarithmic score, which can be 
interpreted as a predictive likelihood, and is strictly proper relative to any 
such class A4. 

Hereinafter, we restrict attention to the case in which the sample space 
is the real line, A is the Borel a-algebra, /x is the Lebesgue measure, and M 
corresponds to some class of Borel probability measures that admit a unique 
smooth Lebesgue density, q. Accordingly, we will consider as a class of 
densities rather than measures, and we may write S(-,g). The logarithmic 
score (1) and the Hyvarinen score (2) admit particularly simple analytic 
forms in terms of the log-likelihood, \nq{x), and its derivatives, which are 
fundamental objects of statistical inference. Therefore, we define locality in 
terms of these quantities. 

Definition 2.2. Let k he a nonnegative integer, and let be a class 
of probability densities with respect to the Lebesgue measure on M that are 
everywhere strictly positive and admit derivatives up to order k. A scoring 
rule S for the class M then is local of order k if there exists a function 
s : M^+fe which we call a scoring function, such that 

S(x,(/) = s(x,lnq(x), . . . , (Ing)'-'^^(x)) 

for every q € A4 and x G M. 

An alternative notion of locality, which allows the predictive density, q, to 
have zeroes, would take the arguments of the scoring function as x, q{x), . . . , 
q^^\x). However, in addition to being natural and facilitating the technical- 
ities, the assumption of strict positivity avoids pathologies, as will be seen 
in Remark 3.9 below. 

As propriety can only be assessed relative to a specified class of predictive 
densities, we now introduce a suitable family. 

Definition 2.3. Let V denote the class of all probability densities, p, 
with respect to the Lebesgue measure on M that satisfy the following con- 
ditions: 

(PI) p is strictly positive on M; 

(P2) p admits four continuous derivatives on M; 

(P3) for every m > and j = 0, 1, . . . , 4, 

lim |x|>(^)(x)=0; 

x— ^itoo 

(P4) there exists a constant a = a{p) > such that 

n^^^ (x) 

lim / \ =0 for j = 1,...,4. 

a;-s>±oo p[x) 
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The class V is quite broad and includes many well-known densities, such 
as all normal and logistic densities, the corresponding skew variants [Genton 
(2004)], and finite mixtures of these densities. In particular, the class V is 
convex, as implied by the following result. 

Lemma 2.4. For every k = 1,2, . . . there exists a polynomial M = M{yi, 
. . . , yk) of degree k such that for all p,q and a € [0, 1] , the density r„ = 
ap + {1 — a)q satisfies 



{lnrj''\x) < M max<^ },.. . ,max<^ ■ 



£Hx)\ 

q{x) 



pointwise in x ^ . 



Proof. Let the polynomial L{yi, . . . ,yk) of degree k be such that the 
kth logarithmic derivative of a smooth function g > can be written as 



(lnff)('=)=L 



5: 9^ 

, . . . , 

g 9 



where here and in the following we suppress the argument x S M. Define the 
polynomial M as L with all coefficients replaced by their absolute values. 
Evidently then, 

|(lnr„)('=)|<Aff^,...,^ 



and it suffices to show that 

< max< , > for j = \, . . . ,k. 

ra [ P q } 

Consider the function /(a) = (aci + (1 — a)co) / {adi + (1 — a)do), where cq, 
ci S M and dQ,di > are constants. Then 

|/(a)|<max{|/(0)|, 1/(1)1} for a € [0,1], 

because f'{a) = (cido — codi) / {adi + (1 — a)do)^ does not change sign. The 
desired inequality follows on setting cq = q^^\x), ci =p^^\x), do = q{x) and 
di =p{x). □ 

Corollary 2.5. The class V is convex. 



In the following, we do not systematically distinguish a scoring rule, S, and 
the corresponding scoring function, s, both of which will simply be referred 
to as scores. 
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3. Characterization of the local proper scores of order 2. Along with 
the logarithmic and the Hyvarinen score, any convex combination thereof is 
a local proper score of order 2. However, the class of the local proper scoring 
rules of order 2 on the real line, M, has a much richer structure, and allows 
for a characterization in terms of concave functionals. 

3.1. Main results. We first introduce classes of functions that satisfy 
suitable polynomial growth conditions. 

Definition 3.1. Let be a nonnegative integer. The class TZk consists 
of all functions K : M.'^'^^ — ?• M that admit continuous partial derivatives up 
to order 2k, and for which there exist finite positive constants C and r 
such that, whenever W stands for K or any of its partial derivatives up to 
order 2k, then 



Note that the growth conditions on the functions in the class TZk, as well 
as the decay conditions on the densities in the class V of Definition 2.3, apply 
to each member individually. They are not required to hold uniformly. 

For a function K G TZk and a density p € let 



The integral exists and is finite by virtue of the growth and decay conditions 
imposed on K and p, respectively. Thus, any K € TZk induces a well-defined 
functional : "P — )■ M. The role of the function K in (3) resembles that 
of a kernel in functional analysis. Hence, we will subsequently refer to K 
as a kernel, for ease of reference. The properties of such kernels and the 
associated functionals play a key role in our subsequent characterization. In 
stating it, we use standard abbreviations to denote the partial derivatives 
of a function of the form g = g{x, yo,. . . , yk); for example, we write djg = 
dg/dyj and d^^g = d'^g/{dxdyj). The proof is given in Section 4. 

The subsequent two results are closely connected to the work of Parry, 
Dawid and Lauritzen (2012); see Remark 3.4. 

Theorem 3.2. Let "P denote the class of probability densities introduced 
in Definition 2.3. 

(a) Consider a kernel K of the form 



\W{x,yo, ...,yk)\< C{{1 + \x\){l + |yo|) • • • (1 + \yk\)} 
for all (x,yo,...,?yfc) GM2+'=. 



(3) 




(4) 



K{x,yo,yi) = cyo + Ko{x,yi) 
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where c is a real constant and Kq is a real function on . If K £ IZi and 
the functional is concave, the function s : — ?• M, defined by 

(5) s{x,yo,yi,y2) = cyo + {1 - yidi - dl^ - y2df;^)Ko{x,yi), 

represents a local score of order 2 that is proper relative to V . 

(b) Conversely, ifsG 7^2 represents a local score of order 2 that is proper 
relative to V , there exists a kernel K €TZi of the form (4), where c is a real 
constant and Kq is a real function on M? , such that the functional is 
concave and s admits the representation (5). 

(c) The above statements remain valid with concave replaced by strictly 
concave, and proper replaced by strictly proper. 

The following sufficient condition for the functional to be concave will 
be proved in Section 5.1. 

Proposition 3.3. Suppose that K is a kernel of the form (4) such that 
(i) A' € 7^1, (ii) c < 0, and (iii) the map yi i— )• K^^x^yi) is concave for every 
X € M. Then the functional - V —^M. is concave. The statement continues 
to hold if concave is replaced by strictly concave. 

The criterion provides a straightforward method of constructing local 
proper scores of order 2 via the basic relationship (5). For example, the 
kernel K{x,yo,yi) = -yo yields the scoring function, s(x, yo, ^2) = -yo, 
that represents the logarithmic score (1). The associated functional 

^{p) = S{p,p) = — / p{x)lnp{x)dx 



dKL{p,q) = S{p,q)-S{p,p)= I p{x)ln-^^Y^dx 



is the Shannon entropy, and the associated divergence 

I 

q{x) 

is the Kullback-Leibler divergence. Similarly, the kernel K{x,yo,yi) = —y\ 
yields the scoring function, s{x,yo,yi,y2) = yf + 2y2, that represents the 
Hyvarinen score (2). The associated functional and divergence 



X(^) ^^^^^^ dMp,q)=l 



p{x) q{x) 

are minus the Fisher information and the Fisher information distance [Das- 
Gupta (2008), Definitions 2.5 and 2.6, pages 25 and 26], respectively. For 
further examples, see Section 5.3. 

3.2. Remarks. It has to be emphasized that the present work owes a great 
deal to interactions with Philip Dawid, Steffen Lauritzen and Matthew 
Parry, which began with their kindly pointing out an error in our previ- 
ous work [Ehm and Gneiting (2009)]. 
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Remark 3.4 (Acknowledgment of priority). In the compact notation 
explained in Section 4, any second-order local proper scoring rule can be 
written as 

(6) s = K- 

We learned about this representation in a personal communication [Dawid, 
Parry and Lauritzen (2009)]. Detail on the relation of our work to the paper 
by Parry, Dawid and Lauritzen (2012) is provided in the next remark. 

Remark 3.5. Employing an elegant approach based on operator alge- 
bra, Parry, Dawid and Lauritzen (2012) investigate local proper scoring rules 
on a general open interval on the real line of any order k >0. In a tour de 
force, they establish the existence of key local score functions for any even 
order, and their nonexistence for odd orders, in addition to studying their in- 
variance under data transformations. In the case k = 2 the general form (39) 
of the key local scoring rules in Parry, Dawid and Lauritzen (2012) is es- 
sentially equivalent to ours, up to the parameterization in terms of densities 
rather than log densities. 

Despite the many parallels to the work of Parry, Dawid and Lauritzen 
(2012), there are important differences, including the basic approach and 
techniques employed. A key local score derives from the homogeneous Euler- 
Lagrange equation, which characterizes the scores for which every density p 
is a stationary point of the mapping q i— t- S(p, q). Accordingly, Parry, Dawid 
and Lauritzen's (2012) analysis is in terms of differential calculus, which 
leads to separate discussions of the boundary terms from partial integra- 
tions and of sufficient conditions for (strict) propriety. The latter occur in 
Theorem 9.1 of Parry, Dawid and Lauritzen (2012) in the form of concavity 
conditions on homogeneous g-functions, which correspond to our kernels; 
Proposition 3.3 states essentially the same result in the case k = 2. 

In a different ansatz, our work starts from the characterization of proper 
scoring rules via concave functionals and their (super-)gradients [Hendrick- 
son and Buehler (1971), Gneiting and Raftery (2007)]. This readily yields 
the basic form (18) of the second-order local proper scoring rules in a natural 
tangent construction, up to a possibly nonlocal term. Only then we apply 
the calculus of variations to show that the possibly nonlocal term vanishes, 
which establishes the definite form (5). Control of the boundary terms from 
partial integrations is vital, and is achieved through our particular choice of 
the classes of scoring functions and predictive densities. The explicit specifi- 
cation of the classes S and T>, along with the tangent construction, allow us 
to give a rigorous, yet full-fledged and practically relevant characterization 
of the second-order local proper scoring rules, hence constitute the main 
original contributions of our work. 



d 

^ dx 



diK. 
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We continue with comments relating to the choice of the class V and the 
complementary roles of the kernel X as a function and a functional, thereby 
touching on the generality of Theorem 3.2 and Proposition 3.3. 

Remark 3.6. There is a slight asymmetry in Theorem 3.2, in that under 
the conditions of the sufficiency part (a) the scoring function is continuous 
only, whereas the necessity part (b) requires it to be four times continuously 
differentiable. Other than this, the theorem accomplishes a full characteri- 
zation of the local proper scoring rules of order 2 relative to the class V of 
Definition 2.3. 

Remark 3.7. Part (a) of Theorem 3.2 expresses a local proper score 
of order 2, s, in terms of a kernel, with suitable properties. Similarly, 
part (b) admits a constructive extension that finds and expresses a suitable 
kernel, K, in terms of a local proper score of order 2, s. See Section 4.3 for 
the explicit construction and Example 5.2 for an illustration. 

Remark 3.8. Theorem 3.2 has been stated for the special class V of 
Definition 2.3. Propriety relative to such a broad class is a fairly demand- 
ing requirement, and from this perspective, part (a) is a strong result. In 
contrast, part (b) would be stronger if propriety was required relative to 
a subclass Vq CV only. On the other hand, Vq must not be too narrow. 
An inspection of Section 4 shows that part (b) remains valid relative to any 
convex subclass VqCIV with the following two additional properties: 

(P5) if a continuous function / on M with at most polynomial growth at 
iboo satisfies f{x){p{x) — q{x)) dx>0 for all p, g € Vq, then / is constant; 
(P6) the richness properties of Lemma 4.9 hold for Vq. 

The inequality in condition (P5) can be replaced by equality, making (P5) 
a variant of the classical property of completeness of the family Vq- Prop- 
erty (P5) is needed in Section 4.2, while property (P6) is required in Sec- 
tion 4.4. The full class V does satisfy these conditions. 

Remark 3.9. The sufficiency part of Theorem 3.2 would be stronger if 
the statement applied relative to larger classes ViDV. The following adap- 
tation of an example of Huber (1974) shows that any such extension may 
entail unexpected effects for strict propriety, with undesirable consequences 
in applications. Suppose that V is augmented to a convex class Vi that 
includes the densities 



where a € (0, 1) and g{x) = x^e ^/r(6) for x > 0. The densities pa satisfy all 
conditions for the class V except for property (PI), since Paix) = at x = 0. 




if x>0 
if X < 
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As the logarithmic derivatives, p'aix)/pa{x), do not depend on a, the Fisher 
information of pa does not depend on a either, hence its negative is not 
strictly concave as a functional on Vi- Accordingly, the Fisher information 
distance does not distinguish the densities pa, that is, dpilPcuP/?) = for 
a,/3 € (0, 1), and the Hyvarinen score (2) fails to be strictly proper relative 
to the augmented class Vi- In particular, strict concavity of the function 
-^"0(3^,^1) of Proposition 3.3 in yi does not imply strict concavity of the 
associated functional, unless we restrict the class of densities under consid- 
eration. 

Remark 3.10. By Proposition 3.3, concavity of a kernel K of the form (4) 
in ui implies concavity of the associated functional $x on the class V. Con- 
versely, what are the consequences of concavity of the functional on the 
kernel K? The example of the logarithmic score (1) demonstrates that mat- 
ters are not straightforward; here the functional is strictly concave, yet 
the kernel K{x,yQ,yi) = —yo is not. 

Now consider any kernel K of the form (4) for which the associated func- 
tional is concave on V. Do we necessarily have c < then? This is 
indeed true if KQ{x,yi) = —y\ represents the Hyvarinen score (2). Then by 
propriety 

< S(p,g) - S(p,p) =dFi{p,q) - cdKL{p,q) 

for all p,q £V, so that c < is necessary if the ratio r = dFi{p, q) / dKhip, q) 
can attain arbitrarily small values. However, V contains all normal densities, 
and if p and q are normal with mean zero and standard deviations a and r, 
then 



dKL{p,q) = ^ 



a-v.^)^ 



and dpi {p, q) 



ct2 



whence r can attain any positive value. The argument clearly depends on 
the class V; it fails if V is replaced by a narrower class Vq for which the 
ratio r is bounded away from zero. Such is in fact possible due to a log- 
arithmic Sobolev inequality, which asserts that for certain classes Vq CP 
one has ^klIp, q) < Cd^iip, q) for p,q ^Vq with a constant C that depends 
only on 7^0- A corresponding reference is Villani (2009): put u= sjpjq and 
dv{x) = q{x)dx in equation (21.3) and consider Remark 21.4. 

4. Proof of Theorem 3.2. Our point of departure is Theorem 1 of Gneit- 
ing and Raftery (2007), which can be traced to Hendrickson and Buehler 
(1971) and characterizes proper scoring rules by means of the supergradients 
of concave functionals on convex classes of probability measures. We state 
it in the special case where that class corresponds to the set V of Lebesgue 
densities introduced in Definition 2.3. Throughout this section propriety is 
understood as propriety relative to V. 
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Theorem 4.1. Let ^ be a real-valued concave functional on V with 
supergradient <!>*(•, p) : M — t- M at p , that is, 

(^{q)-(^{p)- I ^*{x,p){q{x) - p{x))dx <0 forp,qeV. 

Then the scoring rule 

(7) S{-,p) = ^*{-,p) - / ^*{x,p)p{x)dx + ^{p) 

is proper, and 

S(p,p) = / S{x,p)p{x) dx = ^{p) forp^V. 

Conversely, if S is proper, then <&(p) = S{p,p) is a concave functional on V 
with supergradient ^*{-,p) = S{-,p) at p , whence S is of the form (7). 
Furthermore, the above continues to hold with concave replaced by strictly 
concave, and proper replaced by strictly proper. 

For sufficiently regular local proper scoring rules we can compute gradients 
of the corresponding functionals. Specifically, a function G(-,p):M^M is 
a weak gradient, or simply a gradient, of the functional $ at p € P if for 
every q^V 

(8) 

where 

(9) qt = {l-t)p + tq forte [0,1]. 

Any (super-)gradient is defined only modulo an arbitrary additive constant 
that may depend on p, which does not affect the construction (7). 

Theorem 4.1 along with such tangent calculations gives us a construction 
method for local proper scores that readily elucidates their particular form. 
We refer to this approach as the tangent construction and give details in the 
following section, before completing the proof of Theorem 3.2 in a series of 
subsequent steps. 

4.1. Tangent construction of proper scores. In what follows we use com- 
pactified notation whenever possible. As noted, we do not systematically 
distinguish scoring rules, S, and the corresponding scoring functions, s, both 
of which are referred to as scores. Log- likelihoods and their derivatives are 
denoted by zq{x,p) = \n.p{x) and 

Zj{x,p) = {lnp)^^\x) for J = 1, 2, . . . 



t=o 



G{x,p){q{x) -p{x))dx, 
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or simply zq and zj if the density p is fixed. Clearly then, 

4 = = for j = 0,1, 2,..., 

where the prime denotes differentiation with respect to x. We usually sup- 
press the differential, dx, in integrals over x S M, and in the corresponding 
integrands we omit all or part of the arguments whenever these are clear from 
the context. For example, given K € TZk and p, g € P we may abbreviate 

/ K{x,lnq{x), . . . , {lnq)^^\x))p{x) dx 
Jr 

as 

Jkp {K = K,), 

where, evidently. Kg = Kq{x) = K{x,\\iq{x), . . . , {h\q)'^^\x)). 

We now develop the tangent construction. The first step consists in calcu- 
lating the gradients of (not necessarily concave) functionals of kernel type. 

Lemma 4.2. Let K € 7^2- Then a gradient G of the associated functional 
: "P ^ K exists at any p € "P and is given uniquely by 

(10) G = K + doK-^-^ [pdiK] + ^ ^ [pd2K] {G = Gp,K = Kp) 
up to an arbitrary additive constant that may depend on p. 

Recall that according to our notational conventions, (10) means that the 
relation holds whenever the functions G, K and djK are evaluated at argu- 
ments 

(x, 20,21,22) = {xMp{x),{\'a.p)'{x),{hip)"{x)), 
where p (zV and rc G M. 

Proof of Lemma 4.2. Let peV he fixed. In calculating a gradient of 
^ = at p we initially ignore all technicalities, that is, we assume that in- 
tegrals are well defined and finite, that the order of integration and differen- 
tiation can be interchanged, and that boundary terms in partial integrations 
vanish. Then 

(11) JtMlt)]= I J^\Ktqt] = I Kt{q-p) + J 

where qt denotes the mixture density (9) and 

Kt = Kg, = K{xMqt{x), (InqtYix), {lnqt)"{x)). 

Since ^In qt = {q — p)/qt, the mixed derivative with respect to t and x of 
order j is given by 

4[,,.,«,.(i^)" 
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The second term on the right-hand side of (11) can then be computed using 
partial integration, in that 



d 

IE' 



-Kt 



qt 



{doKt 



Q-P 

qt 



+ (diKt 



q-p 
qt 



+ {d2Kt) 



q-p 
qt 



qt 



[12) 



+ 



{doKt){q-p) 
( 



- 



[qtd2Kt 



1 d 



q-p 



qt 



qt 



qt dx qt dx^ 



Evaluating at t 
yield 



0, and noting that (70 = P and Kq 



{q-p). 
Kp = K, (11) and (12) 



t=o 



K + doK--^ [pdiK] + i \pd2K] 
p dx p ax^ 



(q-p), 



showing that G from (10) is indeed a gradient of $ at p. 

It remains to settle the technicalities. Generally, if a family {h{x, t) : x € M, 
t £ [0, 1]} is such that h{x,t) is integrable with respect to x for every t, and 
the family {dth{x,t) : x €M,t £ [0, 1]} of partial derivatives is uniformly in- 
tegrable and continuous in t for every x, then H{t) = j h{x,t) dx is differen- 
tiable with 



Here we consider (11) and identify h{-,t) 
dth{;t) = {Kt + doKt){q-p) 

(13) 



— (0)= / dth{x,0)dx. 

Ktqt, so that 



+ 



{diKt 



q-p 
qt 



+ id2Kt 



q-p 
qt 



qt- 



Now dth{-,t) is continuous in t, because K and its partial derivatives are 
continuous, and their arguments depend continuously on t. Concerning uni- 
form integrability, each of the terms Kt, doKt, diKt, d2Kt grows at most 
polynomially as x ^ ±00. This is because by Lemma 2.4 and property (P4) 
of the class V the arguments of the terms grow at most polynomially; as 
K € 7^2 1 the same is true for the functions themselves. Furthermore, by 
property (P3) and the above, the terms q — p, 



q-p 
. qt 



qt 



qt 



■p q-pq't 



qt qt 



qt = q 



p' - {q-p){^^qt)' 
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and 

(^)"^* = 9" -P" - 2{q'-p'){\nqt)' - {q-p){\nqt)" 
+ {q-p){{\nqt)'f, 

decay faster than the reciprocal of any polynomial as x — > ±00. Therefore, 
the corresponding products in (13) with the terms involving K decay faster 
than the reciprocal of any polynomial as well. By Lemma 2.4, this property 
holds uniformly in i € [0,1]. Thus, the family (13) is uniformly integrable, 
and we may interchange the order of the integration and differentiation. 
Similar growth and decay considerations show that the boundary terms in 
the partial integrations in (12) vanish. 

Finally, uniqueness follows from the property (P5) satisfied by the class V 
(cf. Remark 3.8) and the at most polynomial growth of G as |x| — )■ 00. □ 

For use later on, we also state a version of Lemma 4.2, in which K € IZi 
so that vanishes. The proof is analogous. 



Lemma 4.3. Suppose that the kernel K depends on arguments x,zo 
and z\ only and belongs to TZi. Then a gradient G of the associated func- 
tional : "P — > K exists at any p £V, and is given uniquely by 

(14) G = K + doK-^^[pdiK] {G = Gp,K = Kp) 

up to an arbitrary additive constant that may depend on p. 



Hereafter we will ignore the irrelevant additive constant and refer to the 
expression in (10) and (14), respectively, as the tangent of $ at p. A common 
form of the tangent valid for both k = l and /c = 2 is 

(15) G = K + doK + LqK, 

where the differential operator Lq is formally defined via the infinite sum 

(16) LoK = Y^{-iy-^\pd,K], 

and Lq and K depend tacitly on p. If i^T € TZ^, all but the first k terms 
in the sum vanish, and so the definition makes good sense. In terms of the 
operator L in equation (19) of Parry, Dawid and Lauritzen (2012) we have 
L = -p{do + Lq). 

For the second step of the tangent construction let again $ = be 
a kernel type functional associated with some kernel K in TZi or TZ2- If is 
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concave, then the tangent G of <l> at p € P is easily seen to be also a super- 
gradient, and by Theorem 4.1 a proper score is obtained by setting 

s = G- f Gp + <^{p) (s = Sp,G = Gp) 

(17) 

= K + doK + LoK- j {doK)p {K = Kp,Lo = Lo^p). 

As for the step leading to (17), note that in view of (3) and (15) we have 

$(p) -jGp = jKp-(^jKp + j {doK)p + j {LoK)p^=- j {doK)p 

on using the fact that / {LqK)p = 0. This latter equality holds because the 
integrand is a total derivative, the primitive of which vanishes as x ^ ±00, 
due to the growth and decay properties of the functions in TZk and densities 
in v. We will refer to this trivial observation as the vanishing argument. Since 
it will be used several times we state it as a lemma, despite its simplicity. 

Lemma 4.4 (Vanishing argument). Let p (zV, and let W be a real, dif- 
ferentiable function such that the function g = p~^ -^[pW] is p-integrable, 
f \g\p < 00, and limx^±oo p{x)W (x) = 0. Then f gp = 0. 

This concludes the tangent construction of a proper score s from a concave 
functional <^> = where K G TZk (k = 1,2). We summarize the foregoing 
discussion. 

Proposition 4.5 (Tangent construction). Suppose that K GTZk where 
k = l or k = 2, and that the associated functional is concave. Then 

(18) s = K + LoK + doK- J{doK)p {s = Sp, K = Kp, Lq = Lo^p) 

is a proper score relative to V . It is local of order 2k if OqK is constant in x 
for every p£V, or if f {doK)p does not depend on p. 

Proof. The first claim has already been proved. Locality under the 
stated conditions is obvious: if doK = doK{x,lnp{x), (Inp)'(x)) (for k = 1, 
say) does not depend on x, it equals its expectation, / {doK)p. Finally, an 
explicit evaluation of the total differential(s) in the term LqK yields partial 
derivatives of order < 2k only, thereby proving the order 2k claim. □ 

4.2. Variational calculus. A score s is proper if the functional V 3 
S{p,q) achieves its minimum at q= p, for every p gV. This circumstance 
allows a variational characterization of — in fact, a necessary condition for — 
the local proper scores. 



16 



W. EHM AND T. GNEITING 



Lemma 4.6. Suppose that s € 7^2 is a local proper score relative to V. 
Then for every p one has 

dos + Lqs = dos---^ \pdis] + - \pd2s] = Cp 
p ax n dx^ 

(19) . 

[S — Sp, Lq — Lq^p) 

on M, where 

(20) Cp = j{dos)p (s = Sp). 

Proof. Fix p and consider convex combinations of the form qt = 
(1 — t)p + tq where q £ V and t € (0, 1). As the score is proper, we have 
(S(p, (/j) — S{p,p))/t > for every t. Let us compute the hmit as t — > 0. 
Putting St = Sq^ , we have at first 

t~'^{S{p,qt) -S{qt,qt)) =t~^ / St{p - qt) = - / Sj(g-p). 



Arguing in the same way as in the proof of Lemma 4.2, we find that the 
integrand is uniformly integrable and continuous in t, so the limit exists and 
equals — J s{q — p) where s = sq = Sp. Thus writing 

^{p,qt) - S{p,p) = S{jp,qt) - S{qt,qt) + S{qt,qt) - ^{p,p) 

and using Lemma 4.2 and (16), we get 

\\m t~^{S{p,qt) - S{p,p)) 



= - /s(g -p) + y"(s + 9os - i ^b^is] + ^ ^[p52s]) (g - p) 

= j {dos + Lqs) (q-p). 
It follows that 

(21) jidos + Los){q-p)>0 

for every q gV. We proceed to show that this is possible only if Oqs + Lqs 
equals some constant Cp almost everywhere, hence everywhere by continuity. 
To this end, let / = Sos + Lqs and g = f — J fp- Then f gq>0 for every q gV. 
Suppose g were not constant. Since / gp = 0, the Lebesgue measure of the 
(open) set {g < 0} is strictly positive. Thus, there exists a probability density 
qi € C°° with compact support such that J gqi < 0. Then q = ^{qi + p) GV 
and J gq = J gqi < 0, in contradiction to (21). Finally, the constant Cp is 
easily identified by integrating (19) against p and noting that J{Lqs)p = 0, 
by the vanishing argument. □ 
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Equation (19) essentially is the Euler equation of the calculus of variations 
[Gelfand and Fomin (1963), pages 40-42] and corresponds to equation (24) 
of Parry, Dawid and Lauritzen (2012). Its slightly different form here re- 
sults from the fact that in our case the integrand of the functional to be 
optimized is of the form F{x,lny, (Iny)', (\ny)") rather than of the common 
form F{x,y,y\y"). 

As a first application of the Euler equation we show that local proper 
scores are fixed points of the tangent construction. To this end, let s € 7^2 
be a local proper score of order 2. By Theorem 4.1 and Lemma 4.2, the 
functional ^s'.V ^M. associated with the kernel s is concave. The tangent 
construction then gives 

(22) s = s + Lqs + dos - j {dos)p 

on substituting s for K in Proposition 4.5. Initially, this is another proper 
score, possibly of higher order, and possibly nonlocal. However, by Lemma 4.6 
the right-hand side of (22) reduces to s, whence in fact s. 

Proposition 4.7. For a local proper score s G 7^2 the tangent construc- 
tion based on the (concave) functional <I>s leads back to s. That is, any local 
proper score of order 2 is a fixed point of the tangent construction. 

4.3. Construction of a Z2-independent kernel. The vanishing argument 
of Lemma 4.4 enables us to modify a given kernel without changing the 
associated functional. This strategy is utilized in the following explicit con- 
struction of a 22 -independent kernel from a given local proper score. It is 
analogous to the idea of gauge choice developed in Sections 7.3 and 7.4 of 
Parry, Dawid and Lauritzen (2012). Again, it is tacitly assumed that the 
quantities zj refer to a fixed density p (zV, that {x,p). As before, 

we frequently suppress these quantities when they serve merely as argu- 
ments. 

Proposition 4.8. Given the local proper score s € TZ2, let the kernel K 
be defined as 

where 

(24) V= / d2s{x,zo,t,Z2)dt. 

Jo 

Then K € IZi and = '^'s • In particular, the score s can be reconstructed 
from the kernel K via the tangent construction. 



Proof. The kernel K inherits the polynomial growth properties from s, 
and it is twice continuously differentiable since s G C^. In particular, K is 
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well defined. An application of the vanishing argument to the term ^ ^[p^] 
shows that K and s give rise to the same functional. Thus = ^sj and 
the last claim follows from Propositions 4.5 and 4.7. Therefore, to complete 
the proof it remains to show that the kernel K from (23) does not depend 
on ^2, that is, we need to show that = 0. 
A comparison of the two differential operators 

^2 ^ = d2{d^ + zido + Z2di + 23^2) 

= 9^2 + 2^1 ^02 + <9i + Z2df2 + 23822 

and 

d 

dx 

yields the commutation relation 
Therefore, using diV = 828 [see (24)] we get 

zi + 4- d2V, 
ax 



^82 = dl^ + zidl^ + Z2dl^ + z-idl^ 



d2K = d2S - zid2V - diV - ^d2V ■ 

dx 



where 

(26) d2V= / dl2s{x,zo,t,Z2)dt. 

Jo 

Thus if 

(27) d22s{x,zo{x,p),t,Z2{x,p)) = for ah X G M,p G |t| < |2:i(x,p)|, 

then d2V = and hence d2K = 0, that is, K G TZi , as claimed. The somewhat 
lengthy proof of (27) is given in the next subsection. 

4.4. Proof of the independence condition (27). The proof primarily rests 
upon the Euler equation (19). An evaluation of the total derivatives in (19) 
shows that the Euler equation can be written in the form 

(28) z^^^ ■ a{x, zo, z'o, Zq, Zq) - h{x, zo,z'q, Zq, Zq) = Cp {zq = Inp) 
with 

a = d22S [so that in fact a = a{x, zo,Zq, Zq)] 

and a function b, which depends (only) on the scoring function s and its 
partial derivatives up to order 3, other than x, zq and the logarithmic deriva- 
tives zi,Z2 and z^ = Z2- Therefore, and because s is of smoothness class C^, 
the function b is continuously differentiable. The same holds for a = 822^, 
of course, so that a,b gC^. 
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A step critical to the remainder of the proof consists in showing that the 
constant Cp is independent of p. We state this below as Proposition 4.10; 
its proof hinges on an argument due to Parry, Dawid and Lauritzen (2012). 
The ensuing fact that one and the same equation, (28) with Cp = c, holds for 
every p is then utilized to complete the proof of (27). For each of these 
steps it is important that the class V be sufficiently rich. Let 

Z{x,q,k) = {zo{x,q),zi{x,q),...,Zk{x,q)) 

for k = 0,1, . . . with Zj{x, q) = (\nq)^^\x) as above. 

Lemma 4.9 (T'-richness). Let k £ {0,1,. .. ,4}. (a) For every x G M and 
y G M.^^^ there exists q £V such that Z{x,q,k) = y. (b) For every pair 
Qi,Q2 S "P there exist q gV and x G M such that Z(x, q, k) = Z{x, qi,k) and 
q{u) = q2{u) for u outside some neighborhood of x. 

Proof. This is fairly obvious from the definition of V. For complete- 
ness, we include a proof. As concerns part (a), let x G M and y G M!^^^ be 
fixed. There is some p £V such that zo{x,p) = yo. We will construct q as 
a perturbation q = p{l + ip) of p such that ijj{x) = 0. Certainly q £V ip 
has compact support and is such that l + ^>0, / V'P = 0, and ip £ C^. It 
suffices to show that it is possible to prescribe arbitrary values for the first k 
derivatives of ip at x, subject to those conditions. To that end, let ip = QM 
in the sense that ip{u) = Q{u)M{u) for tt G M, where 

r 

Q{u) = ^^aj(u — xy 

is a polynomial vanishing at x and < Af G is a mollifier type function 
with (small) compact support S such that M{x) = 1 and M^^\x) = for 
j = 1, . . . ,k. Then ip^^^ (x) = Q^^^ (x) for j = 1, . . . ,k, thereby confirming that 
arbitrary values can indeed be prescribed il r > k. By increasing r if nec- 
essary, one can further assume that Q attains both positive and negative 
values on S. 

Let any such Q be fixed. We show that the conditions QM > —1 and 
f QMp = can be satisfied, too. In fact, since Q{x) = one can modify M 
such that QM > — 1 everywhere without affecting its local behavior at x. 
Since /{q<o} Q^^P < there is 5 > such that the interval J = [x — 5,x + 6] 
is contained in the interior of S and JjQMp + /jcn{Q<o} Q^^P < 0- Finally, 
on the set S Fi J'^ (1 {Q > 0} one can modify M such that 

i QMp= [ QMp+ [ QMp+ [ QMp = 

J Jj Jj^nlQKO} ij=n{Q>o} 

without affecting the condition QM > — 1. This concludes the proof of (a). 
For part (b), note that because qi, q2 are continuous probability densities. 



20 



W. EHM AND T. GNEITING 



there is x S M such that qi{x) = q2{x)- The above construction then yields 
a local perturbation (/ € "P of (/2 satisfying Z{x, q, k) = Z{x, qi,k). □ 

Proposition 4.10. The constant Cp in (28) does not depend on the 
density p: there is a constant c € M such that Cp = c for every p . 

Proof. We use an argument in Section 4 (around Condition 4.1) of 
Parry, Dawid and Lauritzen (2012). Equation (28) [resp., (19)] can be con- 
densed to a statement of the form 

(29) F{x,Z{x,pA)) = Cp {xeR,peV), 

where the function F is determined by the score s alone. Suppose that Cp 
is not independent of p. Then there are qi,q2 such that Cg^ ^ Cg^. By 
Lemma 4.9(b) there exist p € "P and xi ^ X2 € M such that Z{xi,p,4:) = 
Z{xi,qi,4) and Z(x2,p,4) = Z{x2,q2,4:). By (29) it follows that for both 
j = 1 and j = 2 

Cg. = F{xj,Z{xj,qj,4)) = F{xj,Z{xj,p,A)) = Cp. 

The contradiction implies that Cp is indeed independent of p. □ 

The following lemma is an easy consequence of the uniqueness theorem 
for higher-order differential equations. 

Lemma 4.11 (Reduction principle). Let k € {0, 1,2,3}, and let a and b 
be functions of arguments x,yo, . . . ,yk- Suppose that the function zq = 
\np{x), X S M is, for every p^V , a solution of the differential equation 

(30) • a(x, z, . . . , z^^'^) = b{x, z,..., z(^)). 
Then a{x, Z{x,p, k)) = for every x € M,p G 7-". 

Proof. Fix p and x € M, and suppose that a(x, Z{x,p, k)) ^ 0. Then 
there is an open interval containing x on which a{-,Z{-,p,k)) does not van- 
ish and b{-,Z{-,p,k))/a{-,Z{-,p,k)) is continuously differentiable. Therefore 
Inp is, perhaps in a smaller neighborhood of x, the only solution to the 
equation (30) whose derivatives up to order A; at x are given by the com- 
ponents of the vector Z(x,p,k). On the other hand, by Lemma 4.9(a) there 
exists g G P such that Z{x,q,k) = Z{x,p,k) but Zk+i{x,q) ^ Zk+i{x,p). By 
assumption this g is a solution of (30), too, with the same initial conditions. 
This contradiction to uniqueness is resolved only if a(x, Z{x,p, k)) = 0. Since 
p (zV and X G M were arbitrary, the proof of the lemma is complete. □ 

Let us combine these facts. Absorbing the (universal) constant Cp = c 
in (28) into the function b, we see that every p gV satisfies a differential 
equation of the form (30) with a = 5|2S. Therefore c?|2s(x, Z{x,p, 2)) = for 
sdlp gV and x G M by the reduction principle. The proof of (27) is completed 
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on noting that for any x € M, p G |t| < |zi(a;,p)| there is a q €V such that 
Z{x,q,2) = {zo{x,p),t, Z2{x,p)), by Lemma 4.9(a). 

4.5. Linear dependence on zq. The fact that a local proper score s G 7^2 
can be represented by means of a 2;2-iiidependent kernel K G TZi will now 
be utilized to show that both s and K depend linearly on the logarithmic 
score, Zq. 

Proposition 4.12 (Linearity in zq). Let K e TZi be the kernel con- 
structed in Section 4-3 from a given local proper score s G 7^2- Then K is of 
the form K{x, zq, zi) = czq + Kq{x, zi) where c is a real constant, and s is of 
the form (5) with the same c. 

Proof. We already know that the score s can be represented as in (18), 
with K not depending on Z2- Furthermore, by the Euler equation (19) and 
Proposition 4.10 there is some constant c such that 

(31) dos-c = --^ (pdis - \pd2s] 

p ax \ ax 

Using these facts along with the commutation relation di = -^di + Oq 
[cf. (25)], we get 

dis = diK - diK - zidf^K - di (^[^1^]) + diiK 
= -z^dl.K - ^[dlK] - dl,K + dl,K 

= -z,d!,K-^[dlK]. 
On the other hand, we have 

d2S = -82 ^diK = -d2{dl^K + dl^K ■ zi + dl^K ■ Z2) = -dfiK, 
ax 

and hence 

Thus pdis = ■£^[pd2s], the right-hand side of (31) vanishes, and dos is con- 
stant, dos = c. It follows that s = czq + g{x, 21,^2) for some function g inde- 
pendent of Zq, and it remains to verify the particular forms of K and s. 

By (23) and the special form of s we have K — czq = g — ziV — 
where now V = /g ' d2g{x,t, Z2) dt. But d2V = 0, by (26) and (27), and clearly 
d2{K — czq) = 0, since K eTZi. Therefore Kq = g — ziV — -^V does not 
depend on Z2, and it also does not depend on zq (since neither g nor V 
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depend on zq), so Kq = Kq{x,zi). This completes the proof of the first 
claim. The tangent construction based on K = czq + Kq{x, zi) then implies, 
upon observing doK — J {d()K)p = 0, that 

s = K - zidiK - ^[diK] 
ax 

= czo + Ko- zidiKo - dl^Ko - Z2duKo, 
which is the desired representation. □ 

4.6. Completion of the proof of Theorem 3.2. The tangent construction 
based on a concave functional with K G IZi yields a proper score, which 
is of the form (5) if the kernel K is of the form (4). This proves part (a). 
Part (b) follows from Propositions 4.8 and 4.12. Finally, part (c) is immediate 
from Theorem 4.1. 



5. Remaining proofs, supplements and examples. 



5.1. Proof of Proposition 3.3. Initially, suppose that K G TZi does not 
depend on yo) so that K = K{x,yi), and is concave in yi for every fixed x. 
Given po,pi &V and t G [0, 1], let pt = tpi + (1 — t)pQ and put a = tpi/pt, 
pointwise for every x G M. Then p[/pt = ap[/pi + (1 — a)Po/pO) whence 



K{-,pJpt) > aK[-,p[/pi) + (1 - a)K[-,pQ/po) 



and so 



K 



x,—{x) ]pt{x)dx 
Pt 



> 



a{x)K 



Pi 



+ {l-a{x))K{ x,^{x) 



J K (^x,—{x)^tpi{x) dx + j 



K 



Po 



Po 



Pt{x)dx 



t)po{x) dx 



= t$x(pi) + (l-i)^x(po)- 

The general case follows by the strict concavity of the entropy functional 
p^ — J plnp. Concerning the claim about strict propriety, the pathology 
described in Remark 3.9 does not occur within the class V, because all den- 
sities p (zV are strictly positive. Thus, the primitive of p' /p exists through- 
out M and equals Inp up to a constant, so that p' /p = q' /q implies p = q. □ 

5.2. Local proper scoring rules of order 1. The representation (5) sug- 
gests that local proper scores of exact order fc = 1 do not exist. In fact, 
Parry, Dawid and Lauritzen (2012) show that there are no key local score 
functions of odd order. Within our framework, we can prove the following. 
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Proposition 5.1. Any local score s S TZi that is proper relative to V is 
of the form s = czq + k{x) for some c < 0. 

Proof. Suppose that s € 7?.i is proper. The Euler equation reduces to 



in this case. Arguing as in Section 4.4, we find that Cj, = c is independent of p 
and that df-^s vanishes on M^. Therefore there are functions g,h depending 
only on x,Z() such that s = zig + h. Plugging this representation into the 
Euler equation gives 



whence 5 = by another application of the reduction principle. Thus doh = c, 
which means that s = czq + k{x). Since —zq represents the logarithmic score, 
s can be proper only if c < 0. □ 

5.3. Examples. In the subsequent examples, we keep the notation to 
a minimum and suppress arguments whenever possible. 

Example 5.2. For n > 2 even and c < 0, let K = czq — zf . Then K eTZi, 
the functional is stricly concave on V, and the tangent construction of 
Proposition 4.5 yields the score 



which is local of order 2 and strictly proper relative to V. 

Conversely, if s is as above, let us carry out the construction of the asso- 
ciated kernel K described in Section 4.3. We set 



i9os — [pdis] = dos + zidis - dl^s - ziBq^s - Z29iiS = Cp (s = Sp) 



c = zidog + doh + zig - d^^g - zid^g = ZQg - d^g + d^h, 




CZQ - Zi + nzi + n{n - l)z^ ^2:2 + c - c 
czo + {n- l){z1 + nzj'-^za), 



V = 




.n-l 
1 




= CZo - Zi . 



The construction indeed recovers the kernel K from the score s. 
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Example 5.3. The special case K = —zf in the previous example gives 
the Hyvarinen score, s = zf + 2z2. Being quadratic in the log-likelihood 
derivative, zi = p' /p, and linear in the second derivative, Z2 = p" /p— {p' /p)'^, 
this score generally is sensitive to outliers. For example, within the Gaussian 
shift-scale family with mean fi and variance c^, the Hyvarinen score reduces 
to s = {x- nf/a^ -2/a^. 

As an alternative, let us consider the kernel K = — In cosh zi, which grows 
only linearly as zi becomes large. The corresponding score 

(32) s = — In cosh + zi tanhzi + Z2{1 — tanh^ zi) 

appears to be more robust, because as \y\ oo, 

y tanh y — In cosh y — )■ In 2, 

and the factor of Z2 tends to zero exponentially, in that 1 — tanh^ y ~ 
4exp(— 2|y|). Of course, the log cosh score (32) is strictly proper relative 
to V, since K is strictly concave. 

6. Data example: Probabilistic weather forecasting. The data example 
in this section illustrates the use of local and nonlocal scoring rules in an 
applied forecasting problem. 

Weather forecasting has traditionally been viewed as a deterministic en- 
terprise that draws on highly sophisticated, numerical models of the atmo- 
sphere. The advent of ensemble prediction systems in the early 1990s marks 
a change of paradigms toward probabilistic forecasting [Palmer (2002), Gneit- 
ing and Raftery (2005)]. An ensemble prediction system consists of multi- 
ple runs of numerical weather prediction models, which differ in the initial 
conditions and/or the mathematical representation of the atmosphere. As 
ensemble forecasts are subject to dispersion errors and biases, some form of 
statistical postprocessing is required, for a happy marriage of mechanistic 
and statistical modeling. 

Wilks and Hamill (2007) and Brocker and Smith (2008) review statistical 
postprocessing techniques for ensemble weather forecasts. State-of-the-art 
methods include the Bayesian model averaging (BMA) approach developed 
by Raftery et al. (2005) and Sloughter et al. (2007), Sloughter, Gneiting and 
Raftery (2010), and the heterogeneous regression, or ensemble model output 
statistics (EMOS), technique of Gneiting et al. (2005) and Thorarinsdottir 
and Gneiting (2010). The BMA approach employs a mixture distribution, 
where each mixture component is a parametric probability density associ- 
ated with an individual ensemble member, with the mixture weight reflect- 
ing the member's relative contributions to predictive skill over a training 
period. In contrast, the EMOS predictive distribution is a single parametric 
distribution. 
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For concreteness, consider an ensemble of point forecasts, fi,...,fk, for 
surface temperature, x, at a given time and location. The goal is to fit pre- 
dictive distributions that are as sharp as possible, subject to them being 
calibrated [Gneiting, Balabdaoui and Raftery (2007)]. Let (/>(x; cr^) de- 
note the normal density with mean /i € M and variance cr^ > evaluated at 
x G R. The BMA approach of Raftery et al. (2005) employs Gaussian com- 
ponents with a linearly bias-corrected mean. The BMA predictive density 
for temperature then becomes 



with BMA weights, wi, . . . ,Wk, that are nonnegative and sum to 1, bias 
parameters oi, . . . , and bi, . . . ,bk, and a common variance parameter, o"^. 
The EMOS approach of Gneiting et al. (2005) employs a single Gaussian 
predictive density, in that 



with regression parameters a and hi, . . . ,bk, and spread parameters c and d, 
where is the variance of the ensemble values. The EMOS technique thus 
is more parsimonious, while the BMA method is more flexible. 

Following the original development in Raftery et al. (2005) and Gneiting 
et al. (2005), we apply the BMA and EMOS methods to the five- member 
University of Washington Mesoscale Ensemble over the North American Pa- 
cific Northwest [Grimit and Mass (2002)], at a prediction horizon of 48 hours. 
Here we compare the predictive performance of the BMA and EMOS den- 
sity forecasts for surface temperature verifying in the period of 24 April to 
30 June 2000, which is the largest period common to those used by Raftery 
et al. (2005) and Gneiting et al. (2005). The predictive models were fitted on 
trailing training periods of length 25 days for BMA and length 40 days for 
EMOS, as recommended and described in the aforementioned papers. Over- 
all, there were 23,691 individual forecast cases at individual meteorological 
stations and valid times, when aggregated temporally and spatially over the 
test period and the Pacific Northwest, comprising the states of Washing- 
ton, Oregon and Idaho, and the southern part of the Canadian province of 
British Columbia. All scores reported are averaged over the 23,691 forecast 
cases. 

In Table 1 we assess these forecasts, by computing the mean score under 
various local proper scoring rules, namely the logarithmic score (LS), the 
Hyvarinen score (HS) and the log cosh score (LCS) introduced in (32). In 
addition, we consider two popular nonlocal scores, namely the quadratic 
score (QS) and the spherical score (SphS), defined as 



k 




i=l 



q{x\fi, . . . ,/fc) = (j){x;a + H \- bkfk,c + ds"^) 



QS{x,q) = \\q\\2 — 2q{x) and SphS(x,(/) 



1\\2 



26 



W. EHM AND T. GNEITING 



Table 1 

Mean logarithmic score (LS), Hyvdnnen score (HS), log cosh score (LCS), quadratic 
score (QS) and spherical score (SphS) for statistically postprocessed ensemble forecasts of 

surface temperature over the North American Pacific Northwest m April-June 2000, 
using Bayesian model averaging (BMA) and ensemble model output statistics (EMOS), 
respectively. See the text for details 



Scoring rule 


LS 


HS 


LCS 


QS 


SphS 


BMA 


2.502 


-0.113 


-0.0572 


-0.101 


-0.319 


EMOS 


2.486 


-0.118 


-0.0595 


-0.103 


-0.321 



respectively, where || • ||2 denotes the L2-norm. These scores are strictly 
proper relative to the class of the probability measures with square- integrable 
Lebesgue densities [Matheson and Winkler (1976), Gneiting and Raftery 
(2007)]. 

Under all scoring rules, the EMOS technique shows a slightly lower (i.e., 
better) mean score than the BMA method. However, the differences pale 
when compared to those between the unprocessed ensemble forecast and the 
statistically postprocessed density forecasts. The unprocessed five-member 
ensemble gives a discrete predictive distribution, namely the empirical mea- 
sure in /i, . . . , /s, to which the above scores do not apply directly. However, 
we can compute the mean score for a smoothed ensemble forecast, which 
we take to be normal, with the first two moments identical to those of the 
empirical measure. Under this natural approach, the mean scores for the 
smoothed ensemble forecast are very high, reaching 21.4 for the logarithmic 
score, 1.14 x 10^ for the Hyvarinen score, 0.230 for the log cosh score, 0.194 
for the quadratic score, and —0.217 for the spherical score, thereby attesting 
to the benefits of statistical postprocessing. 

7. Discussion. A scoring rule on the real line is local of order k if the 
score depends on the predictive density only through its value, and the values 
of its derivatives of order up to k, at the realizing event. It is proper if the 
expected score is minimized whenever the predictive density coincides with 
the density underlying the realizing event. Supplementing the fundamental 
work in the recent paper by Parry, Dawid and Lauritzen (2012), we have 
elaborated a suitable framework for a formal characterization of the local 
proper scoring rules in the particular, but most relevant, case of order k <2. 

A practically useful characterization depends on the judicious choice of 
a class S of scoring functions, and a class T> of predictive densities, within 
which scores and densities may vary freely. Involved therein is a delicate 
trade-off, in that narrow classes T> allow for weak assumptions on the mem- 
bers of S, but have little, if any, practical relevance. Our choice of S — the 
class TZ2 of scoring functions growing at most polynomially at infinity — and 
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of V — the class V of densities decaying faster than the reciprocal of any 
polynomial, with log-likelihood derivatives growing at most polynomially — 
appears to be usefully general and achieving a reasonable balance. The bal- 
ance could easily be shifted, for example, in favor of more heavy-tailed den- 
sities, by adapting the polynomial growth order in S. 

Counterexamples show that proper scoring rules of practical interest, such 
as the Hyvarinen score (2), may no longer be strictly proper relative to any 
class V that contains a convex family of densities with a single common 
zero. It is thus natural to assume that all densities in D are strictly positive 
on their common support, il., which then is an interval. The case of finite 
boundary points, for example, when O = (0,oo), appears to be tractable 
similarly to the case = M considered here, and resulting in essentially the 
same characterization. It suffices to impose suitable boundary conditions at 
X = on the classes S and D, guaranteeing the existence of integrals and 
causing the boundary terms in the proof of Lemma 4.2 to vanish. 

With the resurgence of interest in probabilistic forecasting [Gneiting (2008)], 
scoring rules for density forecasts are in increasing demand. In this context, 
locality is an appealing property, which we have studied in this work. A dif- 
ferent argument posits that a scoring rule for probabilistic forecasts ought 
to be sensitive to distance, in the sense that it rewards the assignment of 
greater mass not just to exactly the event or value that is observed, but also 
to nearby events [Stael von Holstein (1969), Jose, Nau and Winkler (2009)]. 
While either approach has appeal, locality and sensitivity to distance appear 
to be mutually exclusive properties, and it is not clear which one is more 
compelling [Mason (2008), Winkler and Jose (2008)]. However, in our me- 
teorological data example as well as in other experience, local and nonlocal 
proper scoring rules generally yield comparable results. 

In addition to their use in the assessment of predictive performance, 
proper scoring rules play major roles in the theory and practice of estimation 
[Dawid (2007), Gneiting and Raftery (2007)]. A striking aspect is that local 
proper scoring rules of order k >2 allow for statistical inference without 
knowledge of normalization constants [Parry, Dawid and Lauritzen (2012)]. 
Indeed, this was the motivation for the initial development by Hyvarinen 
(2005). The example of the log cosh score (32) shows that local scores can 
be less nonrobust than one might expect. These facets suggest exciting op- 
portunities and novel prospects particularly in complex settings. Undoubt- 
edly, the pioneering work of Hyvarinen (2005, 2007), Dawid and Lauritzen 
(2005) and Parry, Dawid and Lauritzen (2012) has laid the groundwork for 
a wide range of promising future work, both theoretically and methodologi- 
cally, and including discrete and multivariate settings [Dawid, Lauritzen and 
Parry (2012), Ehm (2011)], where the tangent approach may continue to be 
useful and provide new insight. 
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